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Abstract 

Distributed representations of words as 
real-valued vectors in a relatively low¬ 
dimensional space aim at extracting syn¬ 
tactic and semantic features from large text 
corpora. A recently introduced neural net¬ 


work, named word2vec (Mikolov et al. 


2013a Mikolov et al., 2013b I, was shown 


to encode semantic information in the di¬ 
rection of the word vectors. In this brief 
report, it is proposed to use the length of 
the vectors, together with the term fre¬ 
quency, as measure of word significance in 
a corpus. Experimental evidence using a 
domain-specific corpus of abstracts is pre¬ 
sented to support this proposal. A use¬ 
ful visualization technique for text corpora 
emerges, where words are mapped onto a 
two-dimensional plane and automatically 
ranked by significance. 

1 Introduction 

Discovering the underlying topics or discourses in 
large text corpora is a challenging task in natu¬ 
ral language processing (NLP). A statistical ap¬ 
proach often starts by determining the frequency 
of occurrence of terms across the corpus, and us¬ 
ing the term frequency as a criterion for word 
significance—a thesis put forward in a seminal pa- 


within a given frequency range, function words, 
which primarily have an organizing function and 
carry little or no meaning, appear together with 
content words, which represent central features of 
texts and carry the meaning of the context. In other 
words, the rank of a term in the frequency list is by 


itself not indicative of meaning (Luhn, 19581. 


This problem can be tackled by replacing the 
corpus-wide term frequency with a more refined 
weighting scheme based on document-specific 


term frequency (Aizawa, 20001. In such a scheme, 
a document is taken as the context in which a word 
appears. Since key words are typically repeated 
in a document, they tend to cluster and to be less 
evenly distributed across a text corpus than func¬ 
tion words of the same frequency. The fraction 
of documents containing a given term can then be 
used to distinguish them. Much more elaborate 
statistical methods have been developed to further 
explore the distribution of terms in collections of 


documents, such as topic modeling (Blei et ah. 


20031 and spacing statistics (Ortuno et ah, 20021. 


An even more refined weighting scheme is ob¬ 
tained by reducing the context of a word from 
the document in which it appears to a window of 
just a few words. Such a scheme is suggested 


by Harris’ distributional hypothesis (Harris, 19541 
which states “that it is possible to define a linguis¬ 
tic structure solely in terms of the ‘distributions’ 
(= patterns of co-occurrences) of its elements”, or 


per by Luhn (ILuhn, 195811. Lrom the list of terms as Lirth famously put it ( |Lirth, 1957) “a word is 


ranked by frequency, terms that are either too rare 
or too common are usually dropped, for they are 
of little use. Lor a domain-specific corpus, the top 
ranked terms in the trimmed list often nicely sum¬ 
marize the main topics of the corpus, as will be 
illustrated below. 

Lor more detailed corpus analysis, such as dis¬ 
covering the subtopics covered by the documents 
in the corpus, the term frequency list by itself is, 
however, of limited use. The main problem is that 


characterized by the company it keeps”. 

Word co-occurrence is at the heart of several 
machine learning algorithms, including the re¬ 
cently introduced word2vec by Mikolov and col¬ 
laborators ( [Mikolov et ah, 2013 aj [Mikolov et ah, 
[2013b I. Word2vec is a neural network with a sin¬ 
gle hidden layer that uses word co-occurrence for 
learning a relatively low-dimensional vector rep¬ 
resentation of each word in a corpus, a so-called 


distributed representation (Hinton, 19861. The di 




















mension is typically chosen of order 100 or 1000. 
This is easily orders of magnitude smaller than 
the size of a vocabulary, which would be the di¬ 
mension when a one-hot representation of words 
is chosen instead. Given the words appearing in 
a context, the neural network learns by predicting 
(the representation of) the word in the middle, or 
vice versa. During training, words that appear in 
similar contexts are grouped together in the same 
direction by this unsupervised learning algorithm. 
The distributed representation thus ultimately cap¬ 
tures semantic similarities between words. This 
has been impressively demonstrated by a series 
of experiments in the original word2vec papers, 
where semantic similarity was measured by the 
dot product between normalized vectors. 

In this brief report, we consider the problem 
of identifying significant terms that give informa¬ 
tion about content in text corpora made up of short 
texts, such as abstracts of scientific papers, or news 
summaries. If is proposed fo use fhe L 2 norm, or 
lengfh of a word vector, in combinafion wifh fhe 
ferm frequency, as measure of word significance. 

In a discussion forum dedicafed fo word2vecQ 
if has been argued by some thaf the length of a 
vector merely reflects the frequency with which a 
word appears in the corpus, while others argued 
that it in addition reflects the similarity of the con¬ 
texts in which a word appears. According to this 
thesis, a word that is consistently used in a similar 
context will be represented by a longer vector than 
a word of the same frequency that is used in dif¬ 
ferent contexts. Below, we provide experimental 
support for this thesis. It is this property that justi¬ 
fies measuring significance by word vecfor lengfh, 
for words represenfed by long vectors refer fo a 
distinctive confexf. 

If is further proposed fhaf fhe scatter plof of 
word vecfor lengfh versus ferm frequency of all 
the words in the vocabulary provides a useful two- 
dimensional visualization of a text corpus. 

The paper is organized as follows. The next sec¬ 
tion introduces the language corpus used and gives 
a global characterization based on term frequency. 
Sectionj^describes the experiments carried out us¬ 
ing word2vec, and presents the main results. Sec¬ 
tion [^concludes the paper with a short discussion. 


*http://groups.google.com/forum/#!forum/word2vec- 

toolkit 


2 Dataset 

For our experiments we use a dataset from the 
arXivj^ a repository of scientific preprinfs. The 
dafasef consisfs of abouf 29k papers from one sin¬ 
gle subjecf class in fhe arXiv, viz. the hep-th sec¬ 
tion on theoretical high-energy physics posted in 
the period from January 1992 to April 2003. Al¬ 
though full papers are available, we consider only 
title and abstract of the papers, which have about 
100 word tokens on averagej^ 

LaTeX (or TeX) commands are removed from 
input through use of the detex program]^ The 
input text is further converted to lowercase, and 
punctuation marks and special symbols are sepa¬ 
rated from words, as was done in the preprocess¬ 
ing step of a word2vec experiment by Mikolov on 
the IMDB dataset of movie reviews|3 

2.1 Term Frequency List 

After removing stop words and punctuation 
marks, the list of 50 most frequently used words 
in the corpus reduces to the one given in Table [T] 
Deriving from a domain-specific dataset, this list 
indeed gives a succinct and fairly precise charac¬ 
terization of the hep-th corpus, which is primar¬ 
ily about “gauge theory”, “quantum field theo¬ 
ries”, and “string theory”. It correctly reveals the 
importance of “models” in this research area, as 
well as the importance of the concepts “space”, 
“solutions”, “action”, “dimensions”, “symmetry 
group”, “equations”, and “algebra”. The term 
“black” refers to “black holes”, which play a dis¬ 
tinctive role in this corpus. The term “also” is not 
filtered out by the NLTK[^ stop word list we use. 
Finally, “show” (used exclusively as verb in this 
corpus, not as noun) appears mostly in the context 
“we show that” and reveals that a large portion of 
the corpus consists of research papers. Note that 
“show” is the only verb besides the stop word “be” 
that made it into the top 50 list. 


^http://arxiv.org/ 

^The dataset is available from the KDD Cup 2003 home- 
page http://www.sigkdd.org/kdd-cup-2003-network-mining- 
and-usage-log-analysis 

‘‘http://www.cs.purdue.edu/homes/trinkle/detex/ 

^https ://groups .google.com/forum/#! msg/word2vec- 
toolkit/Q49FIrNOQRo/J6KG8mUj45sJ 
^www.nltk.org/ 




term 

V 

tf 

theory 

1.90 

27702 

held 

2.00 

17510 

gauge 

2.13 

15536 

string 

2.33 

13523 

model 

2.19 

12389 

quantum 

2.14 

12307 

theories 

2.22 

10528 

space 

2.18 

8035 

also 

1.67 

7907 

models 

2.39 

7313 

two 

2.03 

7286 

helds 

2.11 

7261 

solutions 

2.49 

7129 

show 

1.87 

7125 

action 

2.39 

6602 

one 

1.82 

6440 

black 

3.24 

6011 

dimensions 

2.34 

5953 

symmetry 

2.35 

5792 

group 

2.46 

5696 

equations 

2.54 

5509 

algebra 

2.70 

5461 


Table 1: Top ranked words in the term frequency 
list of the hep-th corpus with their vector length 
V (included for later convenience) and term fre¬ 
quency tf. Punctuation marks and stop words are 
removed from the list. 


3 Experiments 

We next turn to word2vec0For training the neural 
network, we use the same parameter settings as ad¬ 
vertised for the IMDB dataset referred to aboveJ3 
With these settings, the vector dimension is 100, 
the (maximum) context window size is 10, and the 
algorithm makes 20 passes through the dataset for 
learning. The total number of tokens processed by 
the algorithm is 3.2M. As is typical for a highly 
specific domain, the vocabulary is relatively small, 
containing about 44k terms, of which about half is 
used only once. 


^The code is available for download at https:// 
code.google.com/p/ word2 vec 

* Specifically, the parameters used are: word2vec 
-train $inputfile -output $outputfile 
-cbow 0 -size 100 -window 10 -negative 5 
-hs 0 -sample le-4 -threads 40 -binary 0 
-iter 20 -min-count 1 



Figure 1: Cosine similarity between arbitrarily 
chosen pairs of word vectors with tf > 1. 

3.1 Similarity Distribution 

During training, similar words are grouped to¬ 
gether in the same direction by the learning al¬ 
gorithm, so that after training the vectors encode 
word semantics. One of the most popular mea¬ 
sures of semantic similarity in NLP is the cosine 
similarity given by the dot product between two 
normalized vectors. Denoting the cosine of the an¬ 
gle between the two vectors, the cosine similarity 
can take values in the interval [—1,1]. 

To analyze the hep-th corpus, we built a his¬ 
togram of the cosine similarity between arbitrar¬ 
ily chosen pairs of word vectors. The words are 
randomly selected from the vocabulary irrespec¬ 
tive their frequency. We have, however, discarded 
terms that appear only once. The result, given in 
Fig. [T] is a bell-shaped distribution. To our sur¬ 
prise, the distribution is not centered around zero, 
but around a positive value, 0.23. This means that 
word vectors in the hep-th corpus have on aver¬ 
age a certain similarity. Closely related to this is 
that the average word vector is non-zero, having a 
small length, n = 1.37 (?; = 1.51 when words that 
only appear once are excluded). This vector marks 
the center of the word cloud spanned in the word 
vector space by all the words in the vocabulary. 

To see if this behavior is shared by general pur¬ 
pose corpora, we considered Wikipedia by way of 
example]^ For that corpus, covering diverse top¬ 
ics, we found, as expected, the histogram to be 
centered around zero (and slightly right skewed). 
The non-zero value found for the hep-th corpus is 
therefore probably a sign of the homogeneity of 
this dataset. 

The reason for excluding terms that only appear 
once, which after all make up half of the vocabu- 

®For a cleaned version of the Wikipedia corpus from Oc¬ 
tober 2013, see https://blog.lateral.io/2015/06/the-unknown- 
perils-of-mining-Wikipedia/ 






month 


6 


V tf 


January 

4.16 

16 

february 

4.19 

15 

march 

4.90 

37 

april 

4.07 

13 

may 

2.23 

2229 

June 

5.95 

73 

July 

5.54 

54 

august 

5.10 

31 

September 

5.54 

51 

October 

3.83 

11 

november 

4.35 

15 

december 

4.39 

17 


Table 2: The months of the year with their word 
vector length v and term frequency tf. 

lary, is that they have their own bell-shaped distri¬ 
bution, slightly more peaked than the one shown 
in Fig. [T] and centered at a higher cosine similarity 
of about 0.5. 

Note the outlier at zero cosine similarity in the 
distribution in Fig. [T] The reason for this outlier 
eludes us. 

3.2 Vector Length as Significance Factor 

To demonstrate that, besides depending on con¬ 
sistent use, the length of a word vector also de¬ 
pends on term frequency, we consider the months 
of the year, see Table Apart from the word token 
“may”, which in addition denotes a verbal auxil¬ 
iary and is therefore used in many different con¬ 
texts, these terms consistently appear in the ab¬ 
stracts of the hep-th corpus to indicate the time 
of a school or conference where the paper was 
presented. The data clearly show that for fixed 
context, the vector length increases with term fre¬ 
quency, see Fig.|^ 

The three terms with the largest vector length 
are besides “June”, “school” (u = 5.97, tf = 114) 
and “conference” (v = 5.89, tf = 93). If we take 
vector length as a measure of word significance, 
fhis finding surprisingly supporfs another thesis by 
Luhn ( |Luhn, 1958[ ), which states that: 

The more often certain words are found 
in each others company within a sen¬ 
tence, the more significance may be at- 
tribufed fo each of fhese words. 

Here, the phrase “certain words” refers to words 
that 


5 

4 


0 10 20 30 40 50 60 70 80 

Figure 2: Word vector length as a function of fre¬ 
quency of appearance of the months of the year 
excluding “may”. The line through the data points 
serves as a guide to the eye. 

a writer normally repeats [• • •] as he ad¬ 
vances or varies his arguments and as he 
elaborates on an aspect of a subject. 

Table also nicely demonstrates that term fre¬ 
quency alone does not determine the length of a 
word vector. The term “may” has a much higher 
frequency than the other terms in the table, yet it 
is represented by the shortest vector. This is be¬ 
cause it is used in the corpus mostly as a verbal 
auxiliary in opposing contexts. When a word ap¬ 
pears in different contexts, its vector gets moved in 
different directions during updates. The final vec¬ 
tor fhen represenfs some sorf of weighfed average 
over the various contexts. Averaging over vectors 
that point in different directions typically results in 
a vector that gets shorter with increasing number 
of different contexts in which the word appears. 
For words to be used in many different contexts, 
they must carry little meaning. Prime examples of 
such insignificant words are high-frequency stop 
words, which are indeed represented by short vec¬ 
tors despite their high term frequencies, see Ta¬ 
ble [3 

3.3 Vector Length vs. Term Frequency 

To study to what extent term frequency and word 
vector length can serve as indicators of a word’s 
significance, we represent all words in the vocab¬ 
ulary in a two-dimensional scatter plot using these 
variables as coordinates. Figure]^ gives the result 
for the hep-th corpus. For given term frequency, 
the vector length is seen to take values only in a 
narrow interval. That interval initially shifts up¬ 
wards with increasing frequency. Around a fre¬ 
quency of about 30, that trend reverses and the in¬ 
terval shifts downwards. 






term 

V 

tf 

the 

1.49 

257866 

of 

1.51 

148549 


1.43 

131022 


1.41 

84595 

in 

1.59 

80056 

a 

1.61 

72959 

and 

1.51 

71170 

to 

1.61 

53265 

we 

1.88 

49756 

is 

1.69 

49446 

for 

1.62 

34970 

) 

2.03 

28878 


Table 3: Top 12 terms in the term frequency list of 
the hep-th corpus with their word vector length v 
and term frequency tf. In addition to punctuation 
marks, this list exclusively features stop words. 


Both forces determining the length of a word 
vector are seen at work here. Small-frequency 
words tend to be used consistently, so that the 
more frequently such words appear, the longer 
their vectors. This tendency is reflected by the up¬ 
wards trend in Fig. at low frequencies. High- 
frequency words, on the other hand, tend to be 
used in many different contexts, the more so, the 
more frequently they occur. The averaging over 
an increasing number of different contexts short¬ 
ens the vectors representing such words. This ten¬ 
dency is clearly reflected by the downwards trend 
in Fig. at high frequencies, culminating in punc¬ 
tuation marks and stop words with short vectors at 
the very end. 

Words represented by the longest vectors in a 
given frequency bin often carry the content of dis¬ 
tinctive contexts. Typically, these contexts are 
topic-wise not at the core of the corpus but more 
on the outskirts. For example, the words with 
the longest vector in the high-frequency ranges 
2^ — 1] with k = 9, 10,11, “inflation” {v = 
4.64, tf = 571), “sitter” {v = 3.81, tf = 1490) 
as in “de Sitter”, and “holes” {v = 3.41, tf = 
2465) as in “black holes”, all refer to general rel¬ 
ativity. General relativity, having its own subject 
class (gr-qc) in the arXiv, is not one of the main 
subjects of the hep-th corpus. It takes a distinctive 
position in this corpus as it mostly appears in stud¬ 
ies that aim at reconciling general relativity with 
the laws of quantum mechanics. 



Figure 3: Word vector length v versus term fre¬ 
quency tf of all words in the hep-th vocabulary. 
Note the logarithmic scale used on the frequency 
axis. The dark symbols denote bin means with the 
kth bin containing the frequencies in the interval 
[2^“^, 2^ — 1] with k = 1, 2, 3,.... These means 
are included as a guide to the eye. The horizon¬ 
tal line indicates the length v = 1.37 of the mean 
vector. 
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Figure 4: Word vector length v versus term fre¬ 
quency tf of all words in the hep-th vocabulary 
labeled nouns (red dots) or adjectives (blue dots). 


3.4 POS Tagging 


To further assess the ability of word vector length 
to measure word significance, we assign part-of- 
speech (POS) tags to each word in the corpus. 
For this task, we use the Stanford POS taggeip^ 
(Toutanova et al., 20031. The final tag assigned to 
a word in the vocabulary is decided by majority 
vote. 

By the way that word2vec learns word represen¬ 
tations, we expect nouns (excluding proper nouns) 
and adjectives to be similarly distributed in the v- 
tf plane. This is indeed what we observe, see 
Fig. 1^ Note that these word types pervade almost 
the entire region covered in the full plot in Fig. 
by the complete vocabulary. 


'^Specifically, we use the english-caseless-leftSwords- 
distsim tagger. For details on this model and for download¬ 
ing, see http://nlp.stanford.edu/software/tagger.shtml 
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Figure 5: Word vector length v versus term fre¬ 
quency tf of all words in the hep-th vocabulary 
labeled verbs (red dots) or function words (blue 
dots). 


Figure 6: Word vector length v versus term fre¬ 
quency tf of all words in the hep-th vocabulary 
labeled proper nouns (red dots) or function words 
(blue dots). 


We also find verbs and adverbs to be similarly 
distributed in the v-tf plane. Again this was to be 
expected given that word2vec learns word repre¬ 
sentations from word co-occurrences. Somewhat 
surprisingly, we observe in Fig. that the distri¬ 
bution of verbs also overlaps with that of function 
words These word types no longer pervade the 
entire region covered in the full plot, but are con¬ 
fined to the bottom band, corresponding to short 
vectors. The fact that function words are repre¬ 
sented by short vectors underscores the ability of 
vector length to measure word significance. 

This proficiency is even more brought out by 
comparing the distribution of function words and 
proper nouns, which typically are indicative of dis¬ 
tinctive contexts in the hep-th corpus. The plot in 
Fig. 1^ shows a clear separation of proper nouns 
and function words for sufficiently large term fre¬ 
quencies. 

3.5 Visualization 

These results suggest an interesting technique for 
visualizing text corpora. By labeling the data 
points in the v-tf scatter plot with their terms, one 
obtains a two-dimensional visualization of all the 
words in the vocabulary. One advantage of a v-tf 
plot is that words are ranked by significance, thus 
allowing for effective exploration of a corpus. To 
deal with the large number of data points, an inter¬ 
active visualization tool can be build that allows 
the user to navigate a mouse pointer over the plot, 
and that shows only the labels of the data points in 

'*As function words we classify: prepositions (IN), pro¬ 
nouns (PRP, PRP$, WP WP$), determiners (DT, PDT, 
WDT), conjunctions (CC), modal verbs (MD), and particles 
(RP). In brackets, we included here the tags used by the Stan¬ 
ford POS tagger. 


the vicinity of the pointer. 

There exist other techniques for visualizing 
high-dimensional data, such as the popular t- 
distributed stochastic neighbor embedding (t- 
SNE) (van der Maaten and Hinton, 20081 ). That 
machine learning algorithm, being an exam¬ 
ple of multidimensional scaling, projects high¬ 
dimensional data points onto a plane such that the 
distances, or similarities between them are pre¬ 
served as well as possible. Words of similar mean¬ 
ing thus tend to be projected together by the t- 
SNE algorithm. Since the cosine similarity is in¬ 
dependent of vector lengths, word significance is 
ignored when using this measure. The t-SNE al¬ 
gorithm therefore arranges the data points entirely 
differently from our proposal. Moreover, in con¬ 
trast to the axes in the v-tf scatter plot, those in 
the t-SNE plot have no direct meaning. 


4 Discussion 

Most applications of distributed representations of 
words obtained through word2vec so far centered 
around semantics. A host of experiments have 
demonstrated the extent to which the direction of 
word vectors captures semantics. In this brief re¬ 
port, it was pointed out that not only the direction, 
but also the length of word vectors carries impor¬ 
tant information. Specifically, it was shown that 
word vector length furnishes, in combination with 
term frequency, a useful measure of word signifi¬ 
cance. Also an alternative to the t-SNE algorithm 
for projecting word vectors onto a plane was in¬ 
troduced, where words are ordered by significance 
rather than similarity. 

We have restricted ourselves to unigrams in this 
exploratory study. Eor more extended experiments 









and applications, including important bi- and tri¬ 
grams into the vocabulary will certainly improve 
results. We have also restricted ourselves to run¬ 
ning word2vec using parameters that were recom¬ 
mended by the developers, and have not attempted 
to optimize them. 

Finally, the question arises whether word vec¬ 
tors produced by other highly scalable machine 
learning algorithms built on top of word co¬ 


occurrences, such as the log bilinear model (Mnih 


and Kavukcuoglu, 2013|l and GloVe (Pennington 


et al., 20141, also encode word significance in their 


length. 
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